Home

Troubleshooting AWS Cloudformation

Project Overview

Throughout, I practiced troubleshooting AWS CloudFormation deployments through a series of tasks:

I created a proof of concept for the Café's deployment strategy using Infrastructure as Code (IaC).

Initial Setup and JMESPath Practice

I started by practicing JMESPath queries at jmespath.org to prepare for working with JSON-formatted data. This was really helpful since I knew I'd be using similar queries with the AWS CLI later.

First, I copied this JSON document to work with:

{
  "desserts": [
    {
      "name": "Chocolate cake",
      "price": "20.00"
    },
    {
      "name": "Ice cream",
      "price": "15.00"
    },
    {
      "name": "Carrot cake",
      "price": "22.00"
    }
  ]
}

I tried several queries to get familiar with the syntax:

Then I practiced with AWS resource data:

{
    "StackResources": [
        {
            "LogicalResourceId": "VPC",
            "ResourceType": "AWS::EC2::VPC"
        },
        {
            "LogicalResourceId": "PublicSubnet1",
            "ResourceType": "AWS::EC2::Subnet"
        },
        {
            "LogicalResourceId": "CliHostInstance",
            "ResourceType": "AWS::EC2::Instance"
        }
    ]
}

I figured out that to get the LogicalResourceId of the EC2 instance, I needed to use:
StackResources[?ResourceType == 'AWS::EC2::Instance'].LogicalResourceId

During this work, I was encouraged to practice using JMESPath expressions throughout, especially with the AWS CLI's --query or --filter parameters. This proved extremely valuable for filtering the output from AWS CLI commands.

Connecting to CLI Host and Configuring AWS CLI

After understanding JMESPath basics, I moved on to establishing an SSH connection to the CLI Host instance. Once connected, I determined which region I was working in:

curl http://169.254.169.254/latest/dynamic/instance-identity/document | grep region

Then I configured the AWS CLI with my credentials:

aws configure

I entered my AccessKey, SecretKey, region name (matching where my EC2 instance was running), and set the output format to JSON.

The environment started by providing me an Amazon EC2 instance named CLI Host that already existed in a VPC named VPC2. This setup allowed me to use the AWS CLI to run CloudFormation commands.

First Stack Creation Attempt

I examined the CloudFormation template that I'd be working with:

less template1.yaml

The template was quite comprehensive, containing:

I pressed RETURN (ENTER) to scroll through the contents of the file while using the `less` command.

I then tried to create my first stack:

aws cloudformation create-stack \
--stack-name myStack \
--template-body file://template1.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters ParameterKey=KeyName,ParameterValue=vockey

To monitor the creation process, I ran:

watch -n 5 -d \
aws cloudformation describe-stack-resources \
--stack-name myStack \
--query 'StackResources[*].[ResourceType,ResourceStatus]' \
--output table

In this command, the watch Linux utility runs the same command every 5 seconds and briefly highlights changes as they occur. The --output table parameter makes reading the results easier.

After watching for a few minutes, I noticed something strange - resources were being created but then suddenly started being deleted. Clearly something was wrong! I pressed CTRL+C to exit the watch utility and ran:

watch -n 5 -d \
aws cloudformation describe-stacks \
--stack-name myStack \
--output table

The stack status showed CREATE_FAILED and then went to ROLLBACK_IN_PROGRESS followed by ROLLBACK_COMPLETE. This confirmed that CloudFormation was automatically rolling back due to a failure.

This behavior was expected as part of the exercise.

Diagnosing the Problem

To understand what went wrong, I checked for CREATE_FAILED events:

aws cloudformation describe-stack-events \
--stack-name myStack \
--query "StackEvents[?ResourceStatus == 'CREATE_FAILED']"

The output showed that the WaitCondition timed out. Since the WaitCondition was supposed to receive a signal from the userdata script on the EC2 instance, this suggested a problem with that script. Unfortunately, the automatic rollback had already deleted the EC2 instance, so I couldn't check its logs.

By default, AWS CloudFormation deletes all resources if any resource defined in the template cannot be successfully created. Because the wait condition resource failed, the entire stack failed and all changes were rolled back.

I verified the stack status was now ROLLBACK_COMPLETE:

aws cloudformation describe-stacks \
--stack-name myStack \
--output table

I deleted this failed stack:

aws cloudformation delete-stack --stack-name myStack

The stack was deleted quickly because it didn't contain resources that needed to be rolled back.

Preventing Rollback to Debug

For my second attempt, I decided to prevent CloudFormation from rolling back on failure so I could investigate:

aws cloudformation create-stack \
--stack-name myStack \
--template-body file://template1.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--on-failure DO_NOTHING \
--parameters ParameterKey=KeyName,ParameterValue=vockey

I monitored the creation again:

watch -n 5 -d \
aws cloudformation describe-stack-resources \
--stack-name myStack \
--query 'StackResources[*].[ResourceType,ResourceStatus]' \
--output table

This time, when the WaitCondition failed, the other resources remained in CREATE_COMPLETE status instead of being deleted. Perfect!

To exit the watch utility, I pressed CTRL+C.

I confirmed the stack was in CREATE_FAILED status:

aws cloudformation describe-stacks \
--stack-name myStack \
--output table

I verified that the same WaitCondition timeout was the issue:

aws cloudformation describe-stack-events \
--stack-name myStack \
--query "StackEvents[?ResourceStatus == 'CREATE_FAILED']"

Investigating the EC2 Instance

Now I could connect to the Web Server EC2 instance to see what went wrong. First, I needed its IP address:

aws ec2 describe-instances \
--filters "Name=tag:Name,Values='Web Server'" \
--query 'Reservations[].Instances[].[State.Name,PublicIpAddress]'

I opened a new terminal window and connected to the Web Server via SSH. Once connected, I examined the cloud-init-output.log:

tail -50 /var/log/cloud-init-output.log

I noticed two critical errors:

  1. "No package http available"
  2. "util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001"

Next, I looked at the part-001 script to see what went wrong:

sudo cat /var/lib/cloud/instance/scripts/part-001

I noticed the script had a #! line with the -e parameter, which meant it would immediately stop if any command failed. The issue was clear - it was trying to install a package called "http" which didn't exist (it should have been "httpd" for the Apache web server).

In summary, because no package named http could be found, the userdata script failed. Therefore, the wait condition never received the success signal, and after 2 minutes, the wait condition timed out. This was why the stack failed.

I disconnected from the Web Server instance and closed that terminal window.

Fixing the Template

Back on the CLI Host, I edited the template to fix the issue:

vim template1.yaml

I navigated to line 128 and changed "http" to "httpd". I verified the fix with:

cat template1.yaml | grep httpd

When using the vi editor, I used the UP and DOWN arrow keys to position my cursor, entered "a" to enter edit mode, made the change, pressed ESC to exit edit mode, and entered ":wq" followed by ENTER to write the change and quit.

I made sure to check that the file was updated correctly. If the yum line had not appeared, my changes might not have been saved.

I deleted the failed stack:

aws cloudformation delete-stack --stack-name myStack

I monitored the deletion with:

watch -n 5 -d \
aws cloudformation describe-stacks \
--stack-name myStack \
--output table

Once the deletion was complete, I created a new stack with the corrected template:

aws cloudformation create-stack \
--stack-name myStack \
--template-body file://template1.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--on-failure DO_NOTHING \
--parameters ParameterKey=KeyName,ParameterValue=vockey

I monitored the creation:

watch -n 5 -d \
aws cloudformation describe-stack-resources \
--stack-name myStack \
--query 'StackResources[*].[ResourceType,ResourceStatus]' \
--output table

This time, all resources were created successfully! I confirmed with:

aws cloudformation describe-stacks \
--stack-name myStack \
--output table

The stack showed CREATE_COMPLETE, and I could see the PublicIP of the web server and the S3 bucket name in the Outputs section.

I tested the web server by opening the public IP in my browser and saw the "Hello from your web server!" message.

I successfully figured out why the stack was failing, discovered the root cause by looking at log files on the EC2 instance, and updated the CloudFormation template to successfully create a set of resources. Using the WaitCondition in combination with the -e parameter in the userdata script ensured that every command ran without error.

Detecting Drift

Next, I wanted to explore CloudFormation's drift detection, so I manually modified a resource:

  1. I opened the AWS Management Console
  2. Navigated to EC2 → Instances → Web Server → Security tab
  3. Clicked on the WebServerSG security group
  4. Edited the inbound rules
  5. Changed the SSH rule's source from 0.0.0.0/0 to My IP
  6. Saved the changes

I arranged the AWS Management Console tab so that it displayed alongside the instructions to make it easier

I also added an object to the S3 bucket created by CloudFormation:

bucketName=$(\
aws cloudformation describe-stacks \
--stack-name myStack \
--query "Stacks[*].Outputs[?OutputKey \
== 'BucketName'].[OutputValue]" \
--output text)
echo "bucketName = "$bucketName

touch myfile
aws s3 cp myfile s3://$bucketName/
aws s3 ls $bucketName/

To detect these changes, I ran:

aws cloudformation detect-stack-drift --stack-name myStack

I monitored the drift detection status:

aws cloudformation describe-stack-drift-detection-status \
--stack-drift-detection-id 

The output showed "StackDriftStatus": "DRIFTED", indicating that at least one resource had changed.

I examined which resources had drifted:

aws cloudformation describe-stack-resources \
--stack-name myStack \
--query 'StackResources[*].[ResourceType,ResourceStatus,DriftInformation.StackResourceDriftStatus]' \
--output table

As expected, the security group showed MODIFIED status, but interestingly, the S3 bucket still showed IN_SYNC. This is because adding files to a bucket doesn't register as drift in CloudFormation (only property changes do).

I looked at the specific details of the security group drift:

aws cloudformation describe-stack-resource-drifts \
--stack-name myStack \
--stack-resource-drift-status-filters MODIFIED

The PropertyDifferences section showed that port 22 was now open only to my IP address instead of 0.0.0.0/0.

I tried updating the stack to see if it would resolve the drift:

aws cloudformation update-stack \
--stack-name myStack \
--template-body file://template1.yaml \
--parameters ParameterKey=KeyName,ParameterValue=vockey

As expected, this didn't automatically resolve the drift. I would need to manually resolve these issues.

The update-stack command does not automatically resolve drift, even though drift has occurred. Manual intervention is required to eliminate drift.

Stack Deletion Challenge

Finally, I attempted to delete the stack:

aws cloudformation delete-stack --stack-name myStack

I monitored the deletion:

watch -n 5 -d \
aws cloudformation describe-stack-resources \
--stack-name myStack \
--query 'StackResources[*].[ResourceType,ResourceStatus]' \
--output table

Most resources were successfully deleted, but the S3 bucket showed DELETE_FAILED. I checked the stack status:

aws cloudformation describe-stacks \
--stack-name myStack \
--output table

The status showed "DELETE_FAILED" with the reason: "The following resource(s) failed to delete: [MyBucket]."

This made sense - CloudFormation won't delete a bucket that contains objects to prevent accidental data loss.

One approach would be to manually delete or move the file from the S3 bucket and then run the delete-stack command again. However, this might not be appropriate if people in the organization have already started storing files in the bucket and other systems depend on the bucket name and location not changing.

For the challenge of keeping the bucket and its contents while successfully deleting the stack, I:

  1. Determined the logical ID of the S3 bucket:
    aws cloudformation describe-stack-resources \
    --stack-name myStack \
    --query "StackResources[?ResourceType=='AWS::S3::Bucket'].LogicalResourceId" \
    --output text
    This returned "MyBucket".
  2. Used the --retain-resources parameter with the delete-stack command:
    aws cloudformation delete-stack \
    --stack-name myStack \
    --retain-resources MyBucket
  3. Verified the bucket still existed with its contents:
    aws s3 ls s3://$bucketName/
  4. Confirmed the stack was completely deleted:
    aws cloudformation describe-stacks \
    --stack-name myStack
    This returned an error indicating the stack no longer existed.

Tips for the challenge:

  1. I made use of the AWS CLI Command Reference to find options for keeping resources created by the stack when running the delete-stack command.
  2. I determined the logical ID of the S3 bucket using a JMESPath expression in a --query parameter.
  3. I used the output from one command as input to another command to achieve success.

This approach successfully kept the S3 bucket and its content while completely removing the CloudFormation stack - perfect for scenarios where you want to preserve certain resources while removing the stack infrastructure.

Business Case Context

In this work, I created a deployment of a web server inside a custom VPC using an AWS CloudFormation template and learned how to troubleshoot deployments to understand the effectiveness of this technology.

Update from Café

I was excited to inform the business owners about the proof of concept work using CloudFormation. I explained how this "Infrastructure as Code" approach was causing me to completely rethink the café's current approach to managing AWS resources that support the cafe website and other business applications.

They were very interested, and they discussed how they could use this technology to reliably create matching but separate development and production environments for new feature development. They also saw the value in features such as drift detection and the ability to tear down and rebuild complex cloud infrastructures consisting of resources from multiple AWS services.

Conclusion

This hands-on troubleshooting experience with AWS CloudFormation taught me valuable lessons about:

The work objectives were successfully completed:

I can see how this "Infrastructure as Code" approach would be incredibly useful for creating consistent cloud deployments, detecting unauthorized changes, and managing complex infrastructure reliably.

Related Topics